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Abstract 

Given i.i.d samples from some unknown continuous density on hyper-rectangle 
[0, O'*, we attempt to learn a piecewise constant function that approximates this un¬ 
derlying density non-parametrically. Our density estimate is defined on a binary 
split of [0,1]“^ and built up sequentially according to discrepancy criteria; the key 
ingredient is to control the discrepancy adaptively in each sub-rectangle to achieve 
overall bound. We prove that the estimate, even though simple as it appears, pre¬ 
serves most of the estimation power. By exploiting its structure, it can be directly 
applied to some important pattern recognition tasks such as mode seeking and den¬ 
sity landscape exploration. We demonstrate its applicability through simulations 
and examples. 


1 Introduction 

An important machine learning task is to efficiently summarize large-scale high¬ 
dimensional data into compact form at multiple resolutions. Since these data are typ¬ 
ically sampled from multi-modal distributions, a natural choice would be using non- 
parametric density estimation methods. Classic empirical distribution (ED) and kernel 
density estimation (KDE) play an important role in nonparametric density estimation. 
Besides their long noticed drawbacks (e.g., ED is non-continuous; KDE is sensitive 
to the choice of bandwidth and scales poorly in high dimensions), they are not good 
summarization tools in dealing with data with high dimension and large size, e.g., 
evaluating them involves each data point and their functional forms provide little direct 
information of the “landscape” of the distribution. 

In this paper, we consider domain partition based approach for density estimation. 
The use of domain partition dates back to histogram, which is still an ubiquitous tool 
in data analysis today; however, its non-scalability in high dimensions limits its ap¬ 
plications. Motivated by the usefulness of histogram and the attempts to adapt it for 
multivariate cases, we propose a novel nonparametric density estimation method. In 
order to approximate distributions with continuous densities, the functional class of 
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Figure 1: Left; a sequence of binary partition and the corresponding tree representa¬ 
tion; if we encode partitioning information (e.g., the location where the split occurs) 
in the nodes, the mapping is one-to-one. Right; the gaps with m = 3, we split the 
rectangle at location D, which corresponds to the largest gap, if it does not satisfy 


.rl. 


densities we adopt is guided by two principles; simple and rich. Simple means that 
the functions in the class have concise forms and are cheap to evaluate; moreover, they 
enjoy nice structures. Rich means that any continuous density can be approximated by 
functions in the class at any accuracy. 

We choose piecewise constant functions supported on binary partitions (Section 
2), which is a more scalable and adaptive partitioning scheme as opposed to mesh used 
by histogram. Most importantly, this class satisfies both requirements; 1) functions in 
the class are defined by the underlying partitions of the domain and can be displayed 
by binary trees; consequently, trees provide a hierarchical summary of the distribution 
and can reveal its landscape in “multi-resolution”; 2) it is well known in calculus that 
any continuous function can be approximated by piecewise constant functions. 

Since the distributions conditioned on each sub-rectangle are uniform for piecewise 
constant densities, we construct the density estimator based on discrepancy criteria. 
We show that, in rather general settings, our estimated density, simple as it appears, 
preserves most of the estimation power, i.e., controls the integration error for the family 
of functions with finite total variation and finite variance, under the same convergence 
rate as Monte Carlo methods. Our algorithm, by exploiting the sequential build-up of 
binary partition as shown in Figure [T] can find the density efficiently. It is also worth 
noting that the family of functions with finite total variation and finite variance is rather 
broad and is sufficient for practical use. 

In summary, we highlight our contributions as follows; 

• To our knowledge, this is the first error analysis on binary partition based den¬ 
sity estimation, which interconnects the study of Quasi-Monte Carlo and density 
estimation. 

• We establish an error bound of the estimator, which is optimal in the 

sense of Monte Carlo integration. Simulations support the tightness. 

• Our method is a general data summarization tool and is readily applicable to 
important learning tasks such as mode seeking and level-set tree construction. 
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2 Density Estimation via Discrepancy 


Let 17 be a hyper-rectangle in A binary partition S on 17 is a collection of sub¬ 
rectangles whose union is 17. Starting with Bi = {17} at level 1 and Bt = {17i,17t} 
at level t, Bt+i is produced by dividing one of regions in Bt along one of its coordinates, 
then combining both sub-rectangles with the rest of regions in Bt', continuing with this 
fashion, one can generate any binary partition at any level (Figure [T]). 

At each stage of sequential built-up of binary partition, to decide whether the sub¬ 
rectangle deserves further partitioning, we need to check whether the points in it are 
“relative” uniformly scattered. Discrepancy, which is widely used in the analysis of 
Quasi-Monte Carlo methods, is a set of criteria to measure the uniformity of points 
in [0, l]'^. The precise definitions of the discrepancy and the variation depend on the 
space of integrands. The classic star discrepancy, which is used to bound the error of 
Quasi-Monte Carlo integration, is defined as. 

Definition 2.1. The star discrepancy of xi ,..., Xn G [0,1]^^ is 




^ n d 

sup ^ ^ lxjG[0,a) 


( 1 ) 


The error bound is the famous Koksma-Hlawka inequality and the proof is included 

inini. 

Theorem 1. (Koksma-Hlawka inequality). Let xi, X 2 , ■■■, Xn G [0, l]'^ and f be defined 
on [0, l]'^, then 
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where s = {1, ...,d} and (/) is the total variation in the sense of Hardy and 

Krause, e.g., for any hyper-rectangle [a, 5], if all the involved partial derivatives of f 
are continuous on [a, b], then 
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We split the sub-rectangle when the discrepancy of points in it is larger than some 
threshold value. In order to find a good location to split for [a, 6] = 11^=1 
we divide jth dimension into m bins [oj, Qj -\- (bj — aj)/m],[oj (bj — aj){m — 
2) /m, Oj {bj — Oj) (m — 1) / m] and keep track of the gaps at aj -f (bj — aj) jm, ...,aj-\- 
{bj — aj){m — l)/m, where the gap pjk is defined as |(l/n)X]r=i + 

(bj — af)k/m) — k/m\ for k = l,...,(m — 1), there are in total [m — l)d gaps 
recorded (Figure [^. [a, b] is split into two sub-rectangles along the dimension and 
location corresponding to maximum gap (Figure [^l. The complete algorithm is given 
in Algorithm[T] and is explained in detail in the following sections. 
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Algorithm 1 Density Estimation via Discrepancy 

Let P( ) define the points and Pr( ) define the probability mass in a hyper-rectangle re¬ 
spectively. W.L.G, we assume that = [0,1]“^ and P{il) = {xi = {xn,Xid),Xi G 
i i-d samples drawn from an underlying distribution. 

1: procedure DENSlTY-ESTlMATOR(f2, P, m, 8) 

B = {[0,l]‘*},Pr([0,1]'^) = 1 

while true do 

for each ri = [ai, bi] in B do 

Calculate gaps {Qjk}j—i,...,d,k—i,...,m—i 

Scale P(ri) = }"ii to P = {xi- = ( ^ 






if P{ri) 7^ 0 and D*^ (P) > ^ then > by Condition in Theorem^ 

t> These values can also be recorded to save confutation 
Divide into rn = [a^i, bn] and = [an, bn] along the max gap (FigurefU 

Pr(rii) = Pr(ri2) = Pr(ri) - Pr(rii) 

B' = B'U{r,i,ri2} ' 
else B' = B' U {rt} 
it B' B then B = B' 
else return B, Pr(-) 


Remark 2.1. Zero probability is not desirable in some applications; it can be avoided by 
adding pseudo count (Laplace smoother) a in line 11, i.e., Pr(rii) = Pr(ri) ■ 

Density d{ri) is recovered by Pr(ri)/ ). 

Remark 2.2. The binary tree shown in Pigure[T]can be constructed as a byproduct and 
the user can specify the deepest level to terminate the algorithm. 


The density, which is a piecewise constant function, is 

i 

fi{x) = ^^d{ri)l{x £ Vi} (3) 

i=l 


where 1 is the indicator function; {r^, d{ri)}\^^ is a list of pairs of sub-rectangles and 
corresponding densities. Since the number of sub-regions is far less than data size, the 
partition is a concise representation of the data; in Experimental Results section, we 
demonstrate how f)[x) can be leveraged in various machine learning applications. 

Compared to histogram which has the same form (|^ but suffers from curse of di¬ 
mensionality, the rational behind our adaptive partition scheme is to avoid splitting the 
sub-rectangle where the data are relatively uniform. One classic results of histogram 
Qo) states that if for each sphere S centered at the origin 


—>■ 0 as n —>■ 0, lim 

n—>-oo 


IK :rrng7 0}| 

n 


= 0 


then 

lim E||p(a:) - 33„(a;)||i = 0, lim ||p(a;) - pn(a;)||i = 0, a.s. 

n—>-oo n—^oo 

where is the bandwidth of histogram, {r"} is the histogram bins at sample size n 
and Pn{x) is the density estimation from {r"} by 

The key tool in proving its convergence is Lebesgue Density Theorem. However, 
our method cannot guarantee the size of each sub-rectangle shrinks to 0, which causes 
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the technical difficulty in proving its consistency. Instead, we establish a weaker con¬ 
vergence result in the following section and leave the pointwise convergence as an open 
problem. 


3 Theoretical Results 


To establish our main theorem, we need the following three lemmas. Lemma 3.1 and 
is trivial to show by if / is smooth enough. The complete proofs of Lemma 3.1 


3.2 


and 3.2 can be found in IITtI : Lemma [33]is quite technical and proved in ifTTl . 


Lemma 3.1. Let f be defined on the hyper-rectangle [o, 5]. Let {[ai^bi] : 1 < i < 
m < cx)} be a split of [a, b]. Then 




Lemma 3.2. Let f be defined on the hyper-rectangle [a, b]. Let f{x) be defined on the 
hyper-rectangle [a, b] by f{x) = f{x) where Xi = 4>i{xi) with (j)i is a strictly monotone 
(increasing or decreasing) invertible function from [a^, bi] onto [a^, bi], then 


Lemma 3.3. LetD^j^ = inf^-j^ £)* (a:i,..., Xn), we have 

Dn,d < 

for all n,d = 1, 2,..., with a multiplicative constant c. 

Remark 3.1. It is also shown that c < 10 in im. The asymptotic behavior of the star 
discrepancy on n is much better (e.g., Halton sequence ifT^ has = 0((\ogn)^/n)); 
but it does not necessarily mean that the uniform bound which is valid for all d and n 
cannot be of order 


Theorem 2. / is defined on d—dimensional hyper-rectangle [a, b] and P = {xi,...,a;„ S 
[a, 6]}. Then we have 


1 / 

' “'[a,6] 


f{x)dx - 




E /(^Ol < n(6* - ai)D:{P)V^^)^\f) (4) 


where P = {i, = 

We are ready to state our main theorem, 

, fn 1 1 ^ 

Theorem 3. / is defined on hyper-rectangle [0,1]“ with (/) < oo and the sub¬ 

rectangles {[ai,bi]}\^i are a split of [0,1]‘^. Let Xi,X n G [0^1]'^ be an Li.d sample 
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set drawn from distribution p{x) defined on [0, l]'^ and Pi = {xn,Xim, Ui G N“*"} 
are points in each sub-region. Consider a piecewise constant density estimator 

i 

p{x) = ^ dil{x G \ai,bi]} 

i=l 


where di = ~ ^ij)) ^nijN, i.e., the empirical probability. In each sub- 

region [ai,bi]. Pi satisfies 

DlfiP,) < (5) 


where cni 
Then 


c ^ positive constant; Pi is defined as {xj = ( 

N 

f f{x)p{x)dx- ^^fixi)\ < if) 

-'[0,1]'^ '''' ' VJV 


’ bid )jj=f 
(6) 


Remark 3.2. ai controls the “relative” uniformity of the points and is adapted for Pi, 
i.e., it imposes more restricted constraint on the region containing large proportion of 
the sample (rii/N). 

Remark 3.3. In Monte Carlo methods, the convergence rate of ^ f{^i) to /jq f{x)p{x)dx 

is of order 0{1/s/N) as long as variance of f{x) under p(a;) is bounded; our density 
estimate is optimal in the sense that it achieves the same rate of convergence. 

Remark 3.4. There are many other p{x) satisfying (|^ such as the empirical distribution 
in the extreme or kernel density estimation with sufficiently small bandwidth. Our 
density estimation is attractive in the sense that it provides a very sparse summary of 
the data but still captures the landscape of the underlying distribution; moreover, the 
piecewise constant function does not suffer from many local bumps as kernel density 
estimation does. 


Corollary 3.1. For any hyper-rectangle A = [a, b] C (0,1)"^. Let P{A) = J^p{x)dx 
and P{A) = f^p(x)dx, then |P(A) — P{A)\ converges to 0 at order 0{\/\fN) 
uniformly. 


Remark 3.5. The total variation distance between probability measures P and P is 
dehned via S{P,P) = sup 4 eB \P(A) — T’(^)|, where B is the Borel a algebra of 
[0, l]'^; in contrast. Corollary 3.1 restricts A to be rectangles. 


4 Computational Aspects 

There are no explicit formulas for calculating D^{xi,..., Xn) and Z?* ^ except for low 
dimensions. If we replace ai in Q and apply Lemma [33] we actually intend to con¬ 
trol Dl{Pi)hy Bs/N/n,. There are several ways to approximate D^{xi ,..., x„): 1) E. 
Thiemard presents an algorithm to compute the star discrepancy within a user specihed 
error by partitioning the unit cube into subintervals EOjlZIlIT]; 2) A genetic algorithm 
to calculate the lower bounds is proposed in m and is used in our experiments; 3) 
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A new randomized algorithm based on threshold accepting is developed in ||9|- Com¬ 
prehensive numerical tests indicate that it improves on other algorithms, especially in 
higher dimension 20 < d < 50. The interested readers are referred to the original 
articles for implementation details. 

In dealing with large data, several simple observations can be exploited to save 
computation: 1) it is trivial that maxj=i ,...d Let 

be the ith smallest element in then ^ ~ 

^1 0. which has complexity 0(n log n). Hence max^^i -D* can be 

used to compare against O'/NIn first before calculating Z3* 2) O'/Njn is 

large when n is small, but is bounded above by 1; 3) 6^fNjn is tiny 

when n is large and Dl^{{xi}f^i) bounded below by Cd n~^ with some 

constant Cd depending on rf®; thus we can keep splitting without checking 0 when 
6\/N jn < e, where e is a small positive constant (say 0.001) specified by the user. 

This strategy may introduce few more sub-rectangles, but the running time gain is 
considerable. 

Another approximation works well in practice is by replacing star discrepancy with 
computationally attractive £2 star discrepancy, i.e., (xi,..., x„) = | ^ YTi=\ la;iG[o,a)- 

in fact, several statistics to test uniformity hypothesis based on D^n'^ 
are proposed in IH; however, the theoretical guarantee in Theorem]^ is no longer 
valid. By Warnock’s formula I®, 

^ 2l-ii " d ^ n d 

^ ^ 12 n min{l - Xifc, 1 - Xjk} 

i = l k = l = l k = l 

Dn^ can be computed in 0(n log'^”^ n) by K. Frank and S. Heinrich’s algorithm ®. 

At each scan of iB in Algorithmj^ the total complexity is at most 0{ni log'^”^ rii) < 

SLi 0{ni log^'”^ n) < 0(n n). 

5 Experimental Results 

5.1 Simulations and Comparisons 

1) To demonstrate the methods and visualize the results, we simulate our algorithm 
through 3 2-dimensional data sets generated from 3 distributions respectively, i.e., 

X - AA(^,E)1{x e [0,l]2}with^ = (.50,.50)^,S = [0.08,0.02; 0.02,0.02]; 

X ~ (l/2)AA(/ii,Ei)l{x e [0,1]2} + (l/2)AA(/i2,S2)l{x e [0, l]^}, with/n = 

(.50, .25)^, El = [0.04,0.01; 0.01, 0.01] and /ra = (.50, .75)"^, E 2 = [0.04, 0.01; 0.01,0.01]; 

X ~ (l/3)^(2,5)^(5,2) -f (l/3)/3(4,2)^(2,4) -f (l/3)/3(l, 3)/3(3,1); where M is the 
Gaussian distribution and /3 is the beta distribution. The size of each data set is 10,000. 

As shown in the first row of Figure [^ we draw the partitions on 2D and render the 
estimated densities with a color map; the corresponding contours of true densities are 
embedded for comparison purpose. 

2) To evaluate the theoretical bound (|^, we choose 3 simple reference functions with 

dimension d = 2, 5 and 10 respectively, i.e., /i(x) = f 2 ix) = 


1 



Er=i Ei=i M^) = (EEi Ei=i and samples are generated fromp(a;) = 

5(11^1 ^i5.5(a;i)+nf=i /35,i5(a;i))- Theerror I /jQ fi{x)p{x)dx-J^^ -^^^ Mx)p{x)dx\ 
is bounded by | fi{x)p{x)dx-^ Mxj)\ + \ Mx)p{x)dx-^J2'^^^ Mxj)\ 

where p{x) is the estimated density; the first term is controlled by which is 

well known in Monte Carlo methods and the second term is controlled by (|^, thus 
the error is of order By varying the data size, the relative error vs. sample 

size is plotted on log-log scale for each dimension in the second row of Figure]^ their 
standard errors are obtained through generating 10 replicas under same distributions. 
Interestingly, the linear pattern shows up as expected. 



Figure 2: First row: simulation on 2D data with 3 different densities; the modes are 
marked by stars. Second row; simulation on 2, 5 and 10 dimensional data (from left to 
right) with sample functions fi, f 2 , f 3 - 

3) There are other two classes of domain partition based density estimators: I) Op¬ 
tional Polya Tree (OPT) ll22l is a class of conjugate nonparametric priors based on 
binary partition and optional stopping; II) Bayesian Sequential Partitioning (BSP) oa 
is introduced as a computationally more attractive alternative to OPT and simulations 
show that the density constructed by BSP is very close to MAP of OPT. However, the 
density estimate of OPT or BSP is obtained by sampling the posterior and the finite- 
sample error bound is not provided, while ours is constructed from a frequentist per¬ 
spective and we establish a theoretical framework for error analysis. The recursion in 
OPT has exponential complexity and BSP in principle searches among the exponential 
number of possible partitions, whereas our partitioning scheme is greedy and results 
in significant speedup. As to binary partition, we no longer restrict the algorithm to 
split the hyper-rectangle evenly (in the middle); by introducing the “gap”, we do the 
partitioning more adaptive to the data. 

To show the efficiency and scalability of our method, we compare it with KDE, 
OPT and BSP in terms of estimation error and running time. We simulate samples 
from X ^ 7riA/i(/ii, E))l{a; G [0, l]'^} with d = {2,3,4, 5, 6} and n = 

{10^, 10"*^, 10^} respectively (refer to Appendix A for detailed experiment settings). 

The estimation error and running time are summarized in Table [T] and Table [^re¬ 
spectively, the standard error is obtained by generating 20 replicas for each (d, n) pair. 
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Error(n 

= 10=^) 



EiTor(n 

= lob 



Error(n 

= lob 



KDE 

OPT 

BSP 

ours 

KDE 

OPT 

BSP 

ours 

KDE 

OPT 

BSP 

ours 

d 













2 

0.2331 

0.2442 

0.2533 

0.2634 

0.1604 

0.1637 

0.1622 

0.1603 

0.0305 

0.0376 

0.0308 

0.0312 


(0.0221) 

(0.0211) 

(0.0163) 

(0.0207) 

(0.0102) 

(0.0101) 

(0.0113) 

(0.0132) 

(0.0021) 

(0.0021) 

(0.0025) 

(0.0027) 

3 

0.2893 

0.2751 

0.2683 

0.2672 

0.2163 

0.1722 

0.1717 

0.1721 

0.0866 

0.0460 

0.0477 

0.0452 


(0.0227) 

(0.0212) 

(0.0233) 

(0.0265) 

(0.0199) 

(0.0222) 

(0.0183) 

(0.0143) 

(0.0047) 

(0.0049) 

(0.0043) 

(0.0034) 

4 

0.3313 

0.2807 

0.2872 

0.2855 

0.2466 

0.1832 

0.1882 

0.1955 

0.1600 

* 

0.0629 

0.0637 


(0.0225) 

(0.0232) 

(0.0217) 

(0.0191) 

(0.0113) 

(0.0185) 

(0.0147) 

(0.0185) 

(0.0057) 

* 

(0.0061) 

(0.0059) 

5 

0.3522 

0.3001 

0.3035 

0.2907 

0.2599 

0.1911 

0.1987 

0.1963 

0.1817 

* 

0.0716 

0.0721 


(0.0317) 

(0.0299) 

(0.0319) 

(0.0302) 

(0.0199) 

(0.0143) 

(0.0122) 

(0.0131) 

(0.0088) 

* 

(0.031) 

(0.0066) 

6 

0.4011 

0.3512 

0.3515 

0.3527 

0.2833 

* 

0.2093 

0.2011 

0.1697 

* 

* 

0.0809 


(0.0318) 

(0.0307) 

(0.0354) 

(0.381) 

(0.0255) 

* 

(0.0166) 

(0.0137) 

(0.0122) 

* 

* 

(0.0071) 


Table 1: The error in Hellinger Distance between KDE, OPT, BSP, our method and 
the true density for each pair {d, n) respectively. Stars indicate that the running time 
exceeds 3600s. The numbers in parentheses are standard errors of 20 replicas. 



Runing Time(n = 

lOb 

Runing Time(n = 10“^) 

Runing Time(n = 

10b 


OPT 

BSP 

ours 

OPT 

BSP 

ours 

OPT 

BSP 

ours 

d 


2 

0.4(0.0) 

1.2(0.1) 

1.7(0.1) 

2.8(0.1) 

23.2(6.4) 

11.2(0.9) 

42.9(0.3) 

263.1(44.9) 

95.8(3.6) 

3 

0.8(0.0) 

1.6(0.3) 

2.2(0.4) 

13.3(1.1) 

27.7(8.4) 

17.1(1.9) 

252(2.8) 

422.8(91.7) 

143.7(4.3) 

4 

1.7(0.1) 

3.5(0.2) 

3.3(0.8) 

137.7(10.2) 

42.3(5.3) 

22.6(2.0) 

* 

684.3(80.2) 

192.4(5.1) 

5 

75.6(3.3) 

4.9(0.3) 

3.2(0.7) 

1731.7(17.7) 

138.2(9.7) 

21.3(2.2) 

* 

1547.9(155.6) 

231.6(6.8) 

6 

251.3(7.9) 

5.1(0.4) 

3.8(0.7) 

* 

179.1(13.4) 

30.0(2.1) 

* 

* 

285.4(10.2) 


Table 2: The average running time in seconds of OPT, BSP and our method for each 
pair (d, n) respectively. Stars indicate that the running time exceeds 3600s. The num¬ 
bers in parentheses are standard errors of 20 replicas. OPT and BSP are implemented in 
C++ and our method is in Matlab; it is noticed that the latency of Matlab dominates 
in small simulations. 

5.2 Mode Detection 

A direct application of the piecewise constant density is to detect modes a, i.e., the 
dense areas or local maxima on the domain. The modes of our density estimator is 
defined as 

Definition 5.1. A mode of the piecewise constant density is a sub-rectangle in the 
partition that its density is largest among all its neighbors as indicated by the stars in 
Figure]^ 

Flow cytometry allows measuring simultaneously multiple characteristics of a large 
number of cells and is a ubiquitous and indispensable technology in medical settings. 
One effort of current research is to identify homogeneous sub-populations of cells au¬ 
tomatically instead of manual gating, which is criticized for its subjectivity and non¬ 
scalability. There are a large amount of recent literatures concerning on auto gating 
and clustering, see 111 and many reference therein. 

In order to apply our method, we regard each cell as one observation in the sample 
space, i.e., if there are n markers attached to a single cell, then the whole data set 
is generated from a hypothetical n dimensional distribution. Mature cell populations 
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CDllty-' ^Dllbt 

(Ijmjfioidi (rnyMoid) 

3,4,5,4.11.121 (17.59,10.13) 


’'^‘-^‘^(Bctlls) (TcJlsI (2 .‘)!i0*7.: 




(B) 



55C-A 

CDllb 

B220^ 

TCR-b 

CD4 

CDS^ 

c-kit 

bca-a 

1 

1.8770 

-0.0157 

0.0057 

2.7186 

3.7947 

0.0297 

-0.2517 

1.9123 


2.2553 

2.3163 

1.7250 

0.0935 

0.0002 

-0.0057 

-0.0863 

0.2609 


2,4440 

0.3126 

3.8834 

0.2021 

3.5156 

0.1466 

-0.1615 

3.0241 

4 

1.7480 

0.3792 

4.2507 

0.0933 

-0.0556 

0.0313 

-0.0929 

2.1578 

b 

1.8593 

0.6694 

3.3148 

0.0301 

-0.2614 

0.0166 

-0.0114 

0.0368 

b 

1,6578 

0.1112 

0.0639 

0.0868 

-0.0315 

-0.0532 

-0.0055 

0.1259 

! 

3.0834 

3.1851 

0.0845 

0.3675 

1.0288 

0.0249 

0.6448 

0.4222 

a 

2.5367 

2.8891 

0.0880 

0.3243 

0.0729 

-0.0041 

-0.0932 

0.4023 

y 

3.3981 

2.5899 

2.4478 

2.9578 

2.6630 

1.3210 

2.8134 

3.4627 

lU 

2.9377 

2.0063 

1.8881 

2.5242 

2.0150 

0.5587 

2.1503 

2.9074 

11 

2.2758 

0.1515 

0.2992 

2.5438 

0.0466 

0.0803 

-0.0044 

0.4341 

12 

2.0420 

-0.0026 

-0.2342 

2.8619 

0.0970 

2.9129 

-0.0120 

0.5958 

18 

2.3562 

3.0953 

0.1961 

2.2922 

0.2796 

0.0253 

0.2475 

0.7455 


(C) 


Figure 3; Flow Cytometry. (A): an illustrative gating sequence, the cell type in each 
gate is attached; (B) there are 13 modes detected by our algorithm, and we arrange these 
modes into a hierarchical dendrogram: at first level, they are grouped by expression 
levels of CDl lb; subsequently, the CDl lb- modes are grouped according to B220 and 
TCR-b then further splitted according to CD4 and CDS on the next level; the CDl Ib-t 
modes are grouped by B220 then by TCR-b; (C) the details of the expression levels of 
each mode. 


concentrate in some high density areas, which can be easily identified in the binary 


partitioned space by Definition 5.1 


One practical issue needs to be addressed for most of the Cytometry analysis tech¬ 
niques: there is asymmetry in sub-populations; by optimizing a predefined loss func¬ 
tion, it is possible that some sparse yet crucial populations are overlooked if the al¬ 
gorithms take most of the efforts to control the loss in denser areas. A remedy for 
this issue is to perform a down-sampling IIIIIISI step to roughly equalize the densities 
among populations then up-sampling after populations are identified; however, this step 
is dangerous that it may fails to sample enough cells in sparse populations, as a result, 
these populations are lost in the down-sampled data. In contrast, our approach does 
not require down-sampling step, and the asymmetry among populations is captured by 
their densities. 

For the mouse bone marrow data studied in ITS), we choose the 8 markers (SSA-C, 
CDl lb, B220, TCR-/3, CD4, CDS, c-kit, Sca-1) that are relevant to the cell types of 
interests; the number of cells is '-^380,000 after removing mutli-cell aggregates and 
co-incident events. 13 sub-populations are identified by our algorithm ( ifTSl and its 
supplementary materials), the results are summarized in Figure]^ 


5.3 Density Topology Exploration and Visualization 

Level set tree a.k.a. connectivity graph (DG), is widely used to represent energy 
landscapes of systems. It summarizes the hierarchy among various local maxima and 
minima in the configuration space. Its topology is a tree and each inner node on the 
tree is a changing point that merges two or more independent regions in the domain. 
With the density estimation at hand, one may construct DG for samples instead of a 
given energy or density function. Unlike KDE that suffers from many local bumps 
and results in an overly complicated DG, ([^ is well suited for this purpose, partially 
because it smoothes out the minor fluctuations and takes only limited number of values; 
moreover, the simple structure of 0 makes the construction of such graph easy (i.e.. 
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one can just scan through each in decreasing order of d{ri), the complete algorithm 
is given in Appendix [^l. The DG of ([^ not only reveals the modes of the density on 
its leaves, it also provides a tool to visualize high dimensional data hierarchically; for 
example, in fiber tractography ca, DG is used to visualize and analyze topography in 
fiber streamlines interactively. 

We demonstrate that how our piecewise density function can be used to construct 
level set trees in Figure The basic pipeline is to scan sub-rectangles sequentially 
according to the decreasing order of their densities and agglomerate the sub-rectangles 
according to their adjacency. 



Figure 4: Level Set Tree. Left: the samples are generated from a Gaussian Mixture 
with 4 modes. Right: the level set tree. The clusters are annotated by A, B, C, D and 
colors are the levels. 


6 Conclusion and Future Work 


We have developed a nonparametric density estimation framework based on discrep¬ 
ancy criteria, proved its theoretical properties and shown that it is applicable to differ¬ 
ent types of problems. We point out several future research directions of interest: 1) 
Current approach deals with continuous features, but how to extend our theories and 
algorithm to handle (unordered) categorical data? 2) Coordinate-wise partition lim¬ 
its the approximation capability, recent progress ||3 in Quasi Monte Carlo on simplex 
provides us a possible alternative to use more flexible partition schemes. 3) Theoret¬ 


ically, how to sharpen Corollary 3.1 in order to enlarge the class of Borel sets rather 
than rectangles? 4) A thorough comparison is necessary to understand the empirical 
differences between our method and OPT or BSP. 5) Our approach can be viewed as 
an unsupervised version of CART (Classification And Regression Trees), therefore, it 
would be interesting to explore the idea of ensemble methods in analogous to boosting 
and random forest 0. 
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A Proofs 


A.l Proof of Theorem 1^ 

Proof. Define /(i) = f{x), where x = apply Theoremj^to 

/(x), we have 


f ~ 1 

/ f{x)dx- - 


From Lemma 


3.2 


'[ 0 . 1 ] 

do.i] 




if) = (/); /[o.i]<i f(.S:)di = {Utlib^-a^)) ^ f{x)dx 


by change of variables and f{xi) = f{xi) by definition. Hence, 0 follows immedi¬ 
ately. □ 


A.2 Proof of Theorem |3] 

Proof. Apply Theorem|^to each [a^, 6^], f = 1,..., I, we have 

~ O'ij) 

i=l 


'[ai.hi] 


f{x)dx - 


(7) 


i=i 


and by triangular inequality, we have 

N I 


[ fix)pix)dx-^^f{x,) <^d, f f{x)dx-^^fix,j) 


ai,bi] ---■ j=i 

-1» 


(8) 


By the definition of di, diN = combine with Theorem^ (|^, 

0 and Lemma [X3I we have 


l d 


J2d^ I J{x)dx-—J2f{x^J) <J2d.Ylihj-a.j)D:^{P^)Vl,^^^'\f) 


'[ai,bi] 


i=i 


-1^n\ if)^2^ r,A r U) 


i=l j=l 

W e 


N V nid c 


N V rijd c 


^ A] ^ JLylOdrrf) 

if) if) 


Apply Lemma [371 

□ 


A.3 Proof of Corollary [XT] 

Proof. In Monte Carlo methods, the convergence rate of P fixi) is of order 
O C^^j^ ). Let /(x) = I{x S [a, 6]} = I[a,b] be defined on [0,1]'', we have var(/) = 
P(A)(1 — P{A)) < 1/4; thus, this error is bounded uniformly. 
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If another indicator function / is defined on [d, h\ C (0,1)"^, then let 


bj - aj 


1 - 


~ l[0,aj) + (o-i + ~ _ (Xj Qj))I[5 • {,.) ++ r (Xj ^i))I[L,l] 

a-j bi — Qi I j’ 1 — bj ^ ^ ' 


and (j){x) = 11^=1 4’j{xj) and apply Lemma 3.2 we have (/) = (/); 

thus, the left term of (|^ is bounded uniformly. 

The error | f,{x)p{x)dx-J^i^^^-^a fi{x)p{x)dx\ is bounded by | fi{x)p{x)dx- 

i Mxj)\ + I /[o fi{x)p{x)dx - i YTj=i fi{xj)\. Combining the two parts, 
the theorem follows by triangular inequality. □ 




B Experiment Setting of Comparison with OPT and 
BSP 


The source codes are obtained from the authors. Their implementation language is 
C++; in contrast, our method is implemented in Matlab. For small data, the latency of 
Matlab dominates the computing time as shown in the first block of Table 


/ Ml \ 
M2 


/ 1/4 1/4 1/2 •• 

1/4 3/4 1/2 •• 

• 1/2 \ 

• 1/2 

M3 


3/4 1/4 1/2 •• 

• 1/2 

\ d4 J 


3/4 3/4 1/2 •• 

• 1/2 y 


and S = O.OII, i.e., the identity matrix, tt = (1/4,1/4,1/4,1/4). The system 
where the comparison is performed is Ubuntu 13.04, AMD64 8-core Opteron 2384 
(SB X6240, 0916FM400J); 31.42GB RAM, 10GB swap. 

C Level Set Tree Algorithm 

For a given partition P and the list of pairs of sub-regions and corresponding densities 
l+i, d{ri)}\_^ as in ([^, we build a graph G based on the adjacency of sub-regions and 
each sub-region is a node on the graph. The algorithm to determine the adjacency of 
sub-region i, j is; 

1: procedure is-ADJACENT(ri, Tj) 

2: Ck = (cfei, ...,cm): the center of rfc, A: e {ij} 

3: h = {Iki, ■■■, hd)' the width of in each dimension, k G {*, j} 

4: for fc 1,..., d do 

5: if \ci^ ^jk\ ^ 4“ then 

6: return False 

7: return True 

G is constructed by connecting adjacent sub-regions. When G has fc > 1 connected 
components {ci, C 2 ,..., Ck}, a “virtual region” r is added into G with density zero and it 
connects the region in each with lowest density in order to make G connected. 
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Starting with empty set So at step 0, the sub-region is added into S sequentially 
according to the decreasing order of densities. At {i — l)th step, we compute the 
connected components based on the induced sub-graph G'(5'i_i). Suppose Si-i = 
{gi,g 2 , ■■■,gj}, where is the connected components and j is the number of compo¬ 
nents; at step i, there are two scenarios when r is added into i) r is adjacent to 

multiple components then S'* = {r(i) U S*_i\U 

and r is the parent of latest added sub-region in each gi ; ii) r is disconnected with all 
the components, then Si = {Si_i, r} and r is a leaf. The complete description of the 
algorithm is: 

1: procedure sub-level-tree(P) 

2: Build G from P 

3: So = 0 

4: i?0 = 0 

5: for i ^ 1,..., ^ do 0 n is the number of nodes in G 

6: Si-i = {gi,g 2 i ■■■, gj} t> {5/c}fc=i are the connected components in 

G(S*_i) 

7: Ri-i = {fi, r 2 , ..., Tj} > Tk is the last sub-region added into 

gk,k = 

8 : if r(i) is adjacent to ,..., 5 ^^} then > r(i) has zth largest density in 

P 

9: -S'* = {r(*) U {p*J‘,=i,S*_i\ U {5*,}*=i} 

10: R^ = {r(*),i?i_i\ U 

11: p(fij) = > p Stores the parent of each sub-region 

12: Color(r(i)) = density 

13: else 

14: S* = {Si_i,r(*)} 

15. Ri jP*—X, r^*^} 

16: Color(r(x)) = density (r(x)) 

17: return p. Color 
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